GPUs have been widely used to accelerate computations exhibiting simplepatterns of parallelism - such as flat or two-level parallelism - and a degreeof parallelism that can be statically determined based on the size of the inputdataset. However, the effective use of GPUs for algorithms exhibiting complexpatterns of parallelism, possibly known only at runtime, is still an openproblem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs.By making it possible to launch kernels directly from GPU threads, this featureenables nested parallelism at runtime. However, the effective use of DP muststill be understood: a naive use of this feature may suffer from significantruntime overhead and lead to GPU underutilization, resulting in poorperformance. In this work, we target this problem. First, we demonstrate how anaive use of DP can result in poor performance. Second, we propose threeworkload consolidation schemes to improve performance and hardware utilizationof DP-based codes, and we implement these code transformations in adirective-based compiler. Finally, we evaluate our framework on two categoriesof applications: algorithms including irregular loops and algorithms exhibitingparallel recursion. Our experiments show that our approach significantlyreduces runtime overhead and improves GPU utilization, leading to speedupfactors from 90x to 3300x over basic DP-based solutions and speedups from 2x to6x over flat implementations.
展开▼